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Abstract 

Multidimensional  transforms  have  widespread  applications  in  com¬ 
puter  vision,  pattern  analysis  and  image  processing.  The  only  ex¬ 
isting  optimal  architecture  for  computing  multidimensional  DFT  on 
data  of  size  n  =  Nd  requires  very  large  rotator  units  of  area  0(n 2) 
and  pipeline- time  O(log?i).  In  this  paper  we  propose  a  family  of  op¬ 
timal  architectures  with  area-time  trade-offs  for  computing  multidi¬ 
mensional  transforms.  The  large  rotator  unit  is  replaced  by  a  combi¬ 
nation  of  a  small  rotator  unit,  a  transpose  unit  and  a  block  rotator 
unit.  The  combination  has  an  area  of  0(NJ+2a )  and  a  pipeline  time 
of  0(Ni~a  logn),  for  0  <  a  <  d/2.  We  apply  this  scheme  to  design 
optimal  architectures  for  two-dimensional  DFT,  DHT  and  DCT.  The 
computation  is  made  efficient  by  mapping  each  of  the  one-dimensional 
transforms  involved  into  two  dimensions. 


Supported  in  part  by  NSA  Contract.  No.  MDA-904-85II-0015,  NSF  Grant  No.  DClt- 
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1  Introduction 


Multidimensional  transforms  are  a  powerful  tool  for  analyzing  multidimensional 
signals.  The  2-D  Discrete  Fourier  Transform  (DFT)  is  widely  used  in  spectrum  anal¬ 
ysis,  speech  processing  and  image  processing.  The  3-D  and  4-D  DFTs  are  used  to 
represent  dynamic  patterns  in  computer  vision  and  pattern  analysis.  Since  the  num¬ 
ber  of  computations  involved  in  such  transforms  is  very  large,  optimal  architectures 
with  efficient  computational  schemes  are  needed. 

There  exists  an  architecture  [GS]  for  computing  multidimensional  DFT  on  a 
single  file  of  n  =  Nd  elements  whose  AT2  performance  achieves  the  known  lower 
bound,  where  A  is  the  area  and  Tv  is  the  pipeline  time.  The  design  consists  of  Nd~l 
DFT(N)  computation  units  and  a  rotator  of  size  0(n2)  for  data  permutation,  fn  this 
paper  we  show  that  if  the  input  is  in  the  form  of  a  2-D  array  of  size  Nd^2+a  x  jVc,/'2~a, 
0  <  a  <  d/2,  then  we  can  design  a  family  of  optimal  architectures  for  different  values 
of  a.  The  design  of  [GS]  is  a  member  of  this  family  when  a  =  d/2.  There  are  two 
cases  depending  on  whether  A)  a  is  an  integer  or  B)  a  is  not  an  integer  but  Na  is 
an  integer.  Our  design  for  Case  A  consists  of  Nd/2+a~l  DFT(N)  arrays,  a  transpose 
unit  of  size  0(Nd  log  n)  and  a  rotator  of  size  0(Nd+2a).  Case  B,  which  is  slightly 
more  complicated,  requires  an  additional  block  rotator  unit  of  size  0{Nd+2a).  The 
maximum  pipeline  time  for  both  the  cases  is  0(Nd^2~a  logrr).  Thus  for  low  values  of 
a  the  area  would  be  small,  whereas  for  large  values  of  a  the  pipeline  time  would  be 
small.  In  addition  all  these  designs  satisfy  the  AT2  lower  bound. 

The  rest  of  the  paper  is  organized  as  follows.  In  Section  2  we  state  the  definition 
and  lower  bounds  for  the  computation  of  multidimensional  linear  transforms.  Non 
optimal  and  optimal  architectures  that  exist  in  the  literature  are  briefly  discussed 
in  Section  3.  Section  4  deals  with  the  family  of  optimal  architectures  for  different 
values  of  a.  In  Section  5  we  propose  optimal  as  well  as  efficient  designs  for  computing 
2-D  DFT,  Discrete  Hartley  Transform  (DIIT)  and  Discrete  Cosine  Transform  (DCT). 


Section  6  summarizes  the  whole  paper. 


2  Preliminaries 

Let  n  be  the  total  number  of  data  elements  that  are  to  be  organized  in  a 
d-dimensional  data  cube.  Each  element  has  to  be  represented  by  d  indexes  n\ ,  n2, . . . ,  nd. 
Any  d-dimensional  linear  transform  can  be  defined  by 

X(kuk2,  ...,y  =  E'"EE  z(rci,n2,  •  ■  .nd)ai(n1,k1)a2(n2,  k2) .  .  .  ad{nd,  kd) 

nd  n2  n2 

(1) 

where  a,s  are  the  transform  functions,  0  <  kt,  nx  <  N{  —  1  for  1  <  i  <  d  and 
n  —  NiN2  . . .  Nd-  For  instance,  ki )  =  exp(—j  jf-riiki),  1  <  i  <  d  for 

d-dimensional  DFT.  In  order  to  simplify  our  analysis,  we  assume 

1.  =  N2  =  •  •  •  =  Ar<f  =  N  and  n  =  Nd 

2.  N  is  a  power  of  2,  that  is,  N  —  2m. 

proceed  to  state  along  the  lines  of  [GS]  the  lower  bound  on  AT2  for  multi¬ 
dimensional  DFT,  DHT  and  DCT.  The  pipeline  time  Tp  is  used  to  describe  the  time 
performance  of  any  circuit.  We  assume  that  if  n  be  the  problem  size,  then  O(logn) 
bits  are  sufficient  to  represent  the  value  of  a  variable.  Vuillemin  [Vu]  has  shown  that 
AT2  —  D(/2)  for  any  chip  computing  a  transitive  problem  of  size  I.  For  multidimen¬ 
sional  DFT,  DHT  and  DCT,  the  information  content  of  a  problem  I  is  Q(nlogn). 
Thus  the  lower  bound  of  AT2  is  ft(N2  log2  n). 

3  Existing  architectures 

In  this  section  we  briefly  discuss  the  various  optimal  and  non-optimal  architec¬ 
tures  for  transforms  with  dimension  d  >  2.  There  exist  schemes  for  computing  1-D 
as  well  as  multidimensional  DFT  in  the  literature.  However,  there  are  no  schemes 
for  computing  multidimensional  DHT  and  DCT.  Once  we  know  how  to  optimally 
compute  1-D  DHT  and  DCT,  optimal  computation  of  multidimensional  DHT  and 
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DCT  would  be  along  the  same  lines  as  that  of  multidimensional  DFT. 

The  schemes  for  computing  2-D  DFT  are  based  on  computing  1-D  DFT  on 
columns  followed  by  1-D  DFT  computation  on  rows.  These  schemes  require  either 
a  separate  array  transpose  unit  [Ch],  [OJ]  or  an  internal  scheme  for  transposing  the 
data  [Zh] .  Chowdhury  et.al.[Ch]  have  designed  a  RAM  array  transposer 
(RAMAT)  which  uses  N2  RAM  cells  of  size  O(logn)  bits  each.  The  area  of  this  unit 
is  0(Ar2log?r)  and  this  dominates  the  area  of  the  design.  By  using  efficient  DFT(N) 
circuits  which  require  0(Arlogn)  time,  AT2  ol  the  design  becomes  0(A Jlilug3n).  This 
is  O(log?r)  away  from  the  optimal.  The  scheme  is  illustrated  in  Fig.l. 
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Architecture  using  RAMAT 


In  another  design  [OJ],  the  RAM  cells  are  replaced  by  O(logn)  bit  shift  registers.  The 
data  movement  is  regulated  by  control  inputs,  the  generation  of  which  is  complicated. 
We  have  improved  upon  the  design  of  [OJ]  by  simplifying  the  control  consideiably. 
For  all  these  array  transposer  designs,  the  minimum  area  possible  is  0(A'2logn). 
Zhang  [Zh]  has  designed  a  mesh  connected  systolic  array,  each  cell  of  which  is  capable 
of  computing  DFT  in  the  vertical  as  well  as  in  the  horizontal  dnection.  AT^  foi  this 
design  is  also  0(Ar4  log3  n).  Bilardi  [BS]  was  the  first  one  to  come  up  with  an  optimal 
DFT  circuit.  The  scheme  consists  of  orthogonal  tree  networks  (OT)  lor  the  row  and 
column  computations  and  a  transposer  of  size  0(n2).  An  overall  time  complexity  of 
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O(logn)  guarantees  the  AT2  lower  bound.  DFT(n)  can  be  computed  optimally  [PV] 
for  a  large  range  of  T,  T  £  [Q(log2n),  0(\/n  log  ?r)],  if  interconnections  based  on  cube 
connected  cycles  are  used.  However  the  minimum  computation  time  of  T  =  0(log  n) 
cannot  be  achieved  by  this  scheme.  Bilardi  [BS]  has  proposed  optimal  schemes  for 
computing  2-D  DFT  for  any  T  in  the  range  [fi(log?i),  0(i/n  log  ??.)].  This  is  possible 
by  organizing  the  input  in  the  form  of  a  2-D  array  with  ‘s’  wavefronts  and  ‘n/s’  input 
lines.  The  pipelined  transposer  unit  with  an  area  of  0(n2js 2  -f  n\ogn)  and  time 
complexity  of  O(slog?z)  can  be  approximated  to  0 (n2/s2)  for  1  <  s  <  yjnf  log  n.  All 
designs  in  this  range  of  ‘s’  achieve  the  AT2  lower  bound. 

Not  much  work  has  been  done  in  the  field  of  VLSI  architectures  for  multidi¬ 
mensional  transforms.  Gertner  and  Shamash  [GS]  have  recently  proposed  an  optimal 
architecture  for  multidimensional  Fourier  transforms.  Their  design  as  shown  in  Fig. 2 
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Fig- 2 


Architecture  using  rotator 


consists  of  Nd~l  arrays  for  computing  1-D  DFT(N)  and  a  rotation  network  array  or 
rotator  for  permuting  the  data.  We  shall  next  discuss  the  theory  and  layout  of  the 
rotator.  The  rotator  is  based  on  a  network  which  performs  rotation  over  an  index 
in  one  step.  Thus  if  the  data  is  represented  by  x(ni,n2,  •  •  •  ,  nj),  then  after  passing 
through  the  rotator  once  it  appears  as  x(n2,  ...,rid,  ?ri)  and  after  passing  it  once 
more  it  appears  as  x(«3,  n4, . . . ,  n2).  Since  O(log?r)  bits  are  used  to  represent  the 
data,  every  rotation  is  equivalent  to  log  n/d  cyclic  shifts  of  its  binary  representation. 
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The  data  enter  the  rotator  in  n  rows.  It  is  then  rotated  with  the  help  of  load-shift 
cells  placed  at  the  connection  of  a  row  input  and  its  column  necklace.  A  necklace 
is  defined  as  a  collection  of  nodes  which  are  formed  on  rotation  of  the  indexes.  For 
example,  for  d  =  3  and  logn  =  6,  the  necklace  generated  by  000011  is  000011  -  001100 
-  110000.  Fig. 3  illustrates  a  rotator  for  n  =  10  and  d  =  2.  The  number  ol  columns 


Fig. 3  Layout  of  4x  4  2-D  DFT 

required  to  layout  the  rotator  is  0 (n).  The  area  complexity  is  0(n 2)  and  the  pipeline 
time  complexity  is  O(log7i).  Thus  this  scheme  achieves  the  AT*  lower  bound. 

The  large  size  of  the  rotator  unit  and  the  large  number  of  1-D  DFT(N) 
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arrays  makes  the  design  of  [GS]  unattractive.  For  moderately  large  n,  the  size  of 
the  interconnection  would  be  huge.  This  would  also  mean  complicated  inter-chip 
and  intra-chip  connections.  We  can  reduce  the  area  of  the  design  drastically  at  the 
expense  of  a  larger  pipeline  time  and  still  achieve  the  AT2  lower  bound.  In  the  next 
section  we  will  show  that  there  exists  a  family  of  optimal  architectures,  of  which  the 
design  of  [GS]  is  a  member. 

4  Family  of  optimal  architectures 

In  the  design  by  Gertner  and  Shamash  [GS],  a  single  file  of  n  —  Nd  elements 
are  processed  by  A/’i-1DFT(N)  circuits.  The  data  is  then  rotated  and  fed  back  to  the 
DFT  computation  block  (see  Fig. 2).  This  process  is  repeated  d  times.  In  our  design 

b —  — H 

I  }N 

x)n 

t}n 

Fig. 4  2-D  data  organization  of  n  —  Nd  elements 

the  data  is  organized  in  a  2-D  array  with  Nd^+U  rows  and  A rt//2“a  columns  as  shown 
in  Fig. 4.  The  bounds  of  a  are  0  <  a  <  d/2.  Without  loss  of  generality  we  assume 
that  d  is  divisible  by  2.  When  a  —  d/2,  the  2-D  array  collapses  into  a  single  column 
of  ATd  elements.  This  is  the  case  investigated  by  [GS].  When  d  is  not  divisible  by  2, 
the  data  can  be  organized  into  a  2-D  array  with  N^dl2^+a  rows  and  A^/2-!-0  columns. 
The  advantage  of  having  a  2-D  data  block  is  that  the  same  DFT  computation  block 
can  be  used  to  compute  DFT  of  multiple  columns.  We  shall  investigate  two  cases 
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here  : 

A.  When  a  is  an  integer 

B.  When  a  is  not  an  integer  but  Na  is  an  integer. 
Case  A  : 


In  this  case  DFT(N)  is  computed  d/2-j-a  times  along  the  columns  followed  by  d/2  —  a 
times  along  the  rows.  The  number  of  DFT(N)  circuits  required  is  Ndl2+a~l  compared 
to  Arci_1  of  [GS] .  The  data  is  passed  through  a  rotator  unit  R0T1.  The  theory  and 
layout  of  this  unit  are  very  similar  to  that  of  the  rotator  of  [GS]  discussed  in  Sec. 3. 
Since  the  number  of  input  lines  to  the  rotator  is  Arc/'2+a,  the  size  of  the  rotator  is 
0(Nd+2a).  The  data  is  circulated  d/2-\-a  times  through  R0T1,  transposed  and  then 
circulated  d/2  —  a  times  through  R0T2.  R0T2  is  a  rotator  unit  very  similar  to  R0T1 
but  with  J\rd/2~a  input  lines.  Note  that  R.0T2  can  be  eliminated  if  the  transpose  unit 
is  constructed  such  that  R0T2  has  the  same  number  of  input  lines  as  R0T1. 


Fig. 5  Data  organization  before  and  after  transpose 


The  procedure  is  as  follows.  We  can  think  of  the  data  block  to  consist  of  N2a  sub¬ 
blocks,  each  of  size  Nd/2~a  x  Nd/2~a.  Thus  if  we  transpose  the  subblocks  parallel}', 
the  size  of  the  data  block  remains  the  same.  In  that  case  there  is  no  need  for  a 
second  rotator  unit.  The  transpose  unit  consists  of  N2a  subunits.  Each  transpose 
subunit  operates  on  a  data  subblock.  Since  the  transpose  subunits  operate  in  par¬ 
allel,  Tp  for  the  transpose  unit  is  0(Nd/2~a  logn).  The  total  area  of  the  transpose 


7 


unit  is  0(Nd  log  n).  Fig. 5  shows  the  data  blocks  before  and  after  the  transpose.  We 
claim  that  we  have  not  lost  any  information  by  transposing  the  data  in  subblocks. 
The  reason  is  as  follows.  DFT(N)  has  to  be  computed  d/2  —  a  times  over  each  row 
(see  Fig. 4).  By  transposing  a  subblock  of  size  A idl2~a  x  Nd^7~a,  each  row  of  length 
Nd/2~a  js  converted  into  a  column  of  length  A7^2-®.  Thus  the  consecutiveness  of  the 
elements  in  a  row  get  converted  into  the  consecutiveness  of  the  elements  in  a  column. 
Thus  computation  of  DFT(N)  d/2  —  a  times  over  each  column  of  the  transposed  data 
block  gives  the  correct  result.  The  block  diagram  for  our  design  for  Case  A  is  shown 
in  Fig. 6. 


Fig. 6  Architecture  for  Case  A 

The  area  complexity  of  the  input  MUX  and  the  output  DMUX  are  0(Nd^2+a). 
If  we  use  the  scheme  of  [OJ]  in  the  computation  of  DFT(N),  the  area  of  the  DFT(N) 
computation  block  would  be  AC/2+a_1  x  0(N 2)  =  0(Nd^2+a+1).  The  area  of  R0T1 
is  0(A^+2a)  and  that  of  the  transpose  unit  is  0(Ard  log  7r).  The  maximum  pipeline 
time  for  this  design  is  the  time  taken  to  transpose  the  data.  TPmax  =  0(Nd^2~a  log  ?r). 
In  order  that  the  design  by  optimal,  Amax  has  to  be  0(Nd+2a).  Thus  our  design  is 
optimal  for  those  values  of  a  for  which  Nd\ogn  <  Nd+2a.  This  means  that  there  is 
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an  additional  constraint  of  a  >  -  logN  logn.  For  all  practical  cases,  |  logyV(log  n)  <  1. 
Thus  for  all  values  of  a  in  the  range  1  <  a  <  d/2,  we  can  design  d-dimensional  DFTs 
optimally. 

Case  B  : 

This  case  is  slightly  more  complicated  than  Case  A.  This  is  because  when  a  is  not 
an  integer,  d/2  -f  a  and  d/2  —  a  are  non-integers.  Our  scheme  for  Case  A  has  to  be 
modified  appropriately. 


7a  7b 

Fig. 7  Data  block  organization  for  Case  B 


Let  a  =  i  T  /,  where  i  is  an  integer  and  /  is  a  simple  fraction.  We  can  think 
of  the  data  to  be  organized  in  a  2-D  array  of  size  Nd^2  x  Nd^2.  The  array  can  be 
partitioned  vertically  into  Na  blocks  each  of  size  Nd/2  x  Nd/2~a  as  shown  in  Fig. 7a. 
If  the  blocks  are  placed  one  after  the  other,  the  data  organization  is  transformed 
into  the  form  illustrated  in  Fig. 7b.  It  is  to  be  noted  that  every  row  of  the  data  of 
Fig. 7a  has  been  split  into  Na  parts  and  occur  in  Afa  different  blocks  of  Fig. 7b.  Since 
the  consecutiveness  of  the  elements  in  a.  column  are  not  lost  in  Fig. 7b,  computation 
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of  DFT(N)  of  the  columns  is  straightforward.  Each  of  the  Na  blocks  is  circulated 
d/2  times  through  a  DFT(N)  computation  block  and  a  R0T1  unit.  The  number  of 
DFT(N)  circuits  in  each  block  is  Ndl2~l .  The  rotator  unit  consists  of  Na  R0T1  units 


8a  8b  8c 

Fig. 8a  Partitioning  data  of  size  n  —  Nd  into  N2a  subblocks 

8b  Subblock  organization 

8c  Reordering  of  subblocks  after  passing  through  BROT 

each  of  size  0(Nd).  Thus  the  total  area  occupied  by  the  rotator  unit  is  Ara  x  0(Nd ) 
=  0{Nd+a).  The  data  is  then  passed  through  a  transpose  unit  and  a  block  rotator 
unit.  The  latter  is  necessary  in  order  that  the  consecutiveness  of  the  elements  in 
the  rows  of  Fig. 7a  are  transformed  into  the  consecutiveness  of  the  elements  in  the 
columns  of  Fig. 7b.  We  shall  illustrate  the  procedure  now.  The  data  block  of  Fig. 8a 
is  partitioned  into  N2a  subblocks  each  of  size  Nd^2~a  x  Nd^2~a.  These  subblocks  are 
arranged  as  shown  in  Fig. 8b.  The  transpose  unit  consists  of  N2a  subunits.  Each 
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Fig. 9  Layout  of  BROT  with  4  subblocks  and  4  elements/subblock 


transpose  subunit  operates  on  a  data  subblock.  The  area  of  the  transpose  unit  is 
N2aO(Nd~2a  log  n)  =  0(Nd  log  n).  Since  the  subunits  operate  parallel}',  the  pipeline 
time  complexity  is  0(Nd^2~a  log  n).  After  transpose,  though  the  rows  and  columns 
of  a  subblock  are  interchanged,  the  subblock  ordering  remains  the  same.  The  data  is 
then  passed  through  a  block  rotator  unit  BROT.  Fig. Sc  illustrates  the  ordering  of  the 
subblocks  after  passing  through  BROT.  The  function  of  BROT  is  to  rotate  a  data 
subblock  such  that  the  relative  position  between  the  elements  in  the  subblock  do  not 
change.  The  layout  of  BROT  is  similar  but  not  identical  to  ROTl.  Since  BROT 


rotates  the  subblocks  only  once,  every  necklace  in  the  layout  of  BROT  would  have 
at  most  2  elements.  In  fact,  all  the  elements  of  N2a  —  Na  subblocks  form  necklaces 
with  one  other  element  and  all  the  rest  form  necklaces  with  themselves.  The  total 
number  of  columns  required  to  rotate  the  subblocks  is  0(Nd^2+a).  The  area  of  BROT 
is  thus  0(Nd+2a).  Fig. 9  illustrates  the  layout  of  BROT  with  4  subblocks  and  4 
elements/subblock.  The  block  diagram  for  the  design  for  Case  B  is  shown  in  Fig. 10. 


Input 


Output 


Fig.  10  Architecture  for  Case  B 


The  area  of  this  design  is  dominated  by  the  block  rotator  unit  with  area  0(Nd+2a ) 
for  |  logA,(log  n)  <  a  <  d/2.  The  maximum  pipeline  time  is  the  time  taken  to  trans¬ 
pose  the  data.  TVmax  =  0{Ndl2~a  log  n).  Thus  AT2  =  0(N2d\o°02n)  =  0(n2  log2  n) 
for  all  a  such  that  |logN(logn)  <  a  <  d/2  and  Na  is  an  integer. 

An  interesting  issue  is  the  size  of  this  family  of  optimal  architectures.  This 
happens  to  be  a  function  of  the  number  of  values  a  can  take.  For  N  =  2m,  the  values 
that  f  can  take  are  —  ]  and  the  values  that  i  can  take  are  i  <  i  <  d/2, 

where  i  =  \\  logAr(log?r)"l.  Let  /  be  the  value  of  /  which  is  greater  than  and  closest 
to  |  logAr(log  n).  The  number  of  possible  values  of  a  is  then  ??r(|  —  i  —  f  +  1)  +  1. 
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This  is  also  the  number  of  members  of  the  family  of  optimal  architectures. 

5  Applications  to  specific  transforms 

In  this  section,  we  apply  the  method  developed  in  Sec. 4  to  design  architectures 
for  some  specific  transforms  like  2-D  DFT,  D1IT  and  DCT.  If  the  input  data  is  in  a 
2-D  array  of  size  N  x  TV,  then  a  straightforward  way  is  to  compute  1-D  transform  on 
the  columns  followed  by  1-D  transform  on  the  rows.  It  is  well  known  that  the  efficient 
way  of  computing  a  1-D  transform  of  N  points  is  to  map  it  into  a  2-D  transform  of 
Ay  x  N2  =  A'  points  and  then  compute  the  1-D  transforms  over  Ah  and  Ar2  points. 
So  a  2-D  transform  over  TV  x  TV  points  would  be  mapped  into  a  4-D  transform  over 
TVi  x  N2  x  N3  x  Ah,  where  NiN2  =  N3N4  =  N  and  Ah  =  Ar3;  N2  =  N4.  It  is  to  be 
noted  that  the  size  of  the  family  of  optimal  architectures  does  not  change  because 
of  the  2-D  to  4-D  conversion.  The  bounds  for  Tp  are  still  [0(log  n),  0(Jn /  log  n)] 
though  the  efficiency  of  the  1-D  computation  increases.  Another  point  of  interest 
is  the  sequence  of  the  dimensions  over  which  the  transforms  can  be  evaluated.  In 
the  case  of  4-D  DFT,  the  sequence  is  not  at  all  important.  DFT  can  be  evaluated 
over  the  3rd  and  1st  dimension  followed  by  the  4th  and  2nd  dimension.  In  the  case 
of  4-D  DHT  or  DCT,  evaluation  over  the  lst(3rd)  dimension  has  to  be  followed  by 
evaluation  over  the  2nd(4th)  dimension.  This  is  because  whereas  evaluation  of  DFT 
over  one  dimension  is  independent  of  the  evaluation  over  the  others,  it  is  not  so  in 
DHT  and  DCT  [CJ]. 

The  2-D  N  x  Ar  input  array  (as  shown  in  Fig.]  la)  is  decomposed  into  subblocks 
of  size  Arl-a  x  A^~a  and  rearranged  into  an  array  of  size  A'1+a  x  Afl_a  as  shown  in 
Fig. lib.  Instead  of  referring  to  DFT,  DHT  and  DCT  separately,  we  refer  to  this 
group  of  transforms  as  DXT.  Fig.  12  illustrates  our  scheme  for  computing  DXT.  The 
data  is  first  passed  through  N1+a/N-i  =  N2Na  DXT(Ah)  computation  units.  The 
rotator  unit  consists  of  Na  R0T1  units,  each  of  area  0(Ar2).  The  rotated  data  is 
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passed  through  N1+a/N 2  =  AriAJa  DXT(Ar2)  computation  units.  In  case  of  DHT  or 
DCT,  the  second  computation  unit  takes  care  of  any  adjustments  that  have  to  be 
made.  The  data  is  then  passed  through  the  transpose  and  the  block  rotator  units. 
Since  A3  =  Ni  and  N4  =  Ah,  the  data  is  circulated  through  the  same  DXT(Ah), 


11a  11b 

Fig. 11a  Partitioning  data  of  size  n  —  N2  into  Ar2a  subblocks 

lib  Subblock  organization 

In  our  scheme,  there  exist  a  family  of  optimal  architectures  with  area  0(Ar2+2a) 
and  pipeline  time  0(Nl~a  log  n),  |log^(logn)  <  a  <  1.  If  we  modify  our  design  of 
the  rotator  unit  such  that  it  consists  of  one  R0T1  unit  with  Arl+a  input  lines  and 
0(A;2+2a)  area,  and  if  a  is  chosen  to  be  logN  Ar2,  then  we  can  do  away  with  the  BROT 
unit.  For  most  cases  a  =  logyy  N2  is  greater  than  the  lower  bound  of  a.  For  this  value 
of  a,  An~a  =  Ah  =  Ar3  and  DXT(Ay  can  be  computed  directly  after  passing  through 
the  transpose  unit.  This  is  a  considerable  gain  in  terms  of  absolute  area.  We  would 
refer  to  this  design  as  the  most  practical  of  all  optimal  designs  for  2-D  DXT. 
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Rotator 

DXTM  unit  DXTM 


Fig. 12  Architecture  for  2-D  DXT 


6  Conclusion 

In  this  paper  we  have  proposed  a  family  of  optimal  architectures  for  comput¬ 
ing  multidimensional  transforms  such  as  those  of  DFT,  DHT  and  DCT.  We  have 
discussed  the  computation  of  multidimensional  DFT  in  detail.  The  architectures  for 
multidimensional  DHT  and  DCT  are  very  similar,  the  only  difference  being  in  the 
computation  block  of  the  1-D  transform.  Since  we  know  how  to  optimally  compute 
1-D  DHT  and  DCT  [CJ],  optimal  computation  of  multidimensional  DHT  and  DCT 
is  no  different  from  that  of  DFT. 

Gertner  and  Shamash  [GS]  have  proposed  an  optimal  architecture  for  computing 
multidimensional  DFT  on  a  single  file  of  n  -  Nd  elements.  The  design  consists  of 
Nd~l  DFT(N)  computation  arrays  and  a  rotator  lor  permuting  the  data.  The  rotator 
with  an  area  of  0(n 2)  dominates  the  area  of  the  design.  The  pipeline  time  complexity 
is  O(log  71).  Thus  this  design  achieves  the  AT£  lower  bound.  We  have  shown  that 
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if  the  input  is  in  the  form  of  a  2-D  array  of  size  Nd/2+a  x  Nd/2~a,  0  <  a  <  d/2,  we 
can  design  a  family  of  optimal  architectures  for  different  values  of  a.  The  design  of 
[GS]  is  a  member  of  this  family  with  a  —  d/2.  We  have  studied  two  cases,  A  when 
a  is  an  integer  and  B  when  a  is  not  an  integer  but  Na  is  an  integer.  The  design  for 
Case  A  consists  of  Nd/ 2+a~1  DFT(N)  computation  arrays,  a  transpose  unit  of  size 
( Ndlogn )  and  a  rotator  of  size  0(Nd+2a).  TPmax  for  this  design  is  0(Nd/2~a  log n). 
Case  B  is  slightly  more  complicated  and  consists  of  an  additional  block  rotator  unit 
BROT  with  area  0(Nd+2a).  For  all  these  designs  the  lower  bound  of  AT2  is  achieved. 

The  size  of  this  family  of  optimal  architectures  is  a  function  of  the  number  of 
values  a  can  take.  For  Ar  —  16  and  d  =  4,  the  number  of  members  of  this  family 
is  7.  We  choose  a  particular  member  depending  on  whether  minimization  of  area  or 
minimization  of  pipeline  time  is  more  important.  A  design  with  low  values  of  a  would 
have  less  area,  whereas  a  design  with  large  values  of  a  would  require  less  time. 

We  know  that  the  efficient  way  of  computing  1-D  linear  transform  over  Ar  points 
is  to  map  it  into  a  2-D  transform  of  A'  =  TV]  N2  points  and  then  compute  1-D  transform 
over  each  dimension.  Similarly  an  efficient  way  of  computing  2-D  transform  of  N  x  N 
points  would  be  to  map  it  into  a  4-D  transform  and  then  compute  1-D  transform 
over  each  dimension.  We  have  designed  optimal  architectures  for  computing  such 
transforms.  The  most  practical  design  in  this  family  of  optimal  architectures  occurs 
when  a  is  chosen  to  be  logN  N2 .  In  this  case  even  though  a  is  not  an  integer,  we  do 
not  need  the  block  rotator  unit.  This  is  a  considerable  saving  in  terms  of  absolute 
area. 
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